DateFinder: detecting date regions on handwritten document images based on positional expectancy
نویسنده
چکیده
Whereas Optical Character Recognition (OCR) technology is used for many documents, such as check, passport, bank statement and receipt, there is a showing interest on modelling of occurrence and location of visual items. However, this source of information attracts much less attention in general OCR. For instance, there are many specific visual items (e.g., dates, writer signatures, calligraphy, author markings, schematic drawings, glyphs and even graffiti) that can be used to explain the underlying meaning and origin of documents. Among the aforementioned visual items, dates play a very important role in many documents (e.g., bank cheques, letters, postal mails, bills and diaries), which can provide time-related clues for readers. Also, the date is central to many administrative applications such as document indexing, translation and retrieval. In this thesis, we propose a method called DateFinder for detecting date regions on handwritten document images based on a four-step processing sequence. Firstly, we perform pre-processing operations on original scanned images, which aim to extract appropriate proposed date blocks. Secondly, a positional expectancy model is used for ‘date’ text blocks to measure how much an unknown region is similar to a date region based on its position. Thirdly, feature representation and classification techniques are used to extract features from an extracted block and compute the probability this block is a date region. Finally, we combine the scores of the positional expectancy model and classification to determine whether an extracted block is a date region. In the experiments, we have obtained encouraging results for detecting date regions in our dataset. However, there are still ample opportunities to improve the proposed DateFinder method, which can be considered for future work.
منابع مشابه
Connected Component Based Word Spotting on Persian Handwritten image documents
Word spotting is to make searchable unindexed image documents by locating word/words in a doc-ument image, given a query word. This problem is challenging, mainly due to the large numberof word classes with very small inter-class and substantial intra-class distances. In this paper, asegmentation-based word spotting method is presented for multi-writer Persian handwritten doc-...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملLearning Document Image Features With SqueezeNet Convolutional Neural Network
The classification of various document images is considered an important step towards building a modern digital library or office automation system. Convolutional Neural Network (CNN) classifiers trained with backpropagation are considered to be the current state of the art model for this task. However, there are two major drawbacks for these classifiers: the huge computational power demand for...
متن کاملMatching Handwritten Document Images
We address the problem of predicting similarity between a pair of handwritten document images written by potentially different individuals. This has applications related to matching and mining in image collections containing handwritten content. A similarity score is computed by detecting patterns of text re-usages between document images irrespective of the minor variations in word morphology,...
متن کاملMorphology Based Handwritten Line Segmentation Using Foreground and Background Information
Currently text line segmentation is an important stage of research in historical document processing. Because of inter-line distance variability and base-line skew variability, line segmentation in unconstrained handwritten document is very difficult. The line segmentation task gets complicated, when overlapping or inter-penetration situation occurs between two consecutive text lines. In this p...
متن کامل